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Nodes in real-world networks are repeatedly observed to form dense clusters, often referred to as 
communities. Methods to detect these groups of nodes usually maximize an objective function, which 
implicitly contains the definition of a community. We here analyze a recently proposed measure 
called surprise, which assesses the quality of the partition of a network into communities. In its 
current form, the formulation of surprise is rather difficult to analyze. We here therefore develop an 
accurate asymptotic approximation. This allows for the development of an efficient algorithm for 
optimizing surprise. Incidentally, this leads to a straightforward extension of surprise to weighted 
graphs. Additionally, the approximation makes it possible to analyze surprise more closely and 
compare it to other methods, especially modularity. We show that surprise is (nearly) unaffected by 
the well known resolution limit, a particular problem for modularity. However, surprise may tend to 
overestimate the number of communities, whereas they may be underestimated by modularity. In 
short, surprise works well in the limit of many small communities, whereas modularity works better 
in the limit of few large communities. In this sense, surprise is more discriminative than modularity, 
and may find communities where modularity fails to discern any structure. 


I. INTRODUCTION 

Networks are often used as a model to describe inter¬ 
actions among components of a system [1, 2]. In its sim¬ 
plest form, a network is composed of a set of vertices (also 
called nodes) and a set of edges connecting them. Many 
real-world systems can be reduced to this scheme, such as 
social networks establishing relations among individuals, 
proteins interacting within the cell or roads connecting 
different cities [3]. What caught the interest of the sci¬ 
entific community was that most of these real networks 
share high-order structural patterns and dynamics, such 
as a wide heterogeneity in the number of neighbors of a 
node, the presence of many triangles or a very low net¬ 
work diameter [4, 5]. Another feature observed in real 
networks is the presence of densely connected groups of 
nodes, known as communities [6]. Nodes in the same 
group usually share similar characteristics or functions 
and, therefore, methods to detect communities in net¬ 
works are of much interest across different fields [7-12] 

Researchers have proposed numerous strategies to de¬ 
tect the community structure of a network [6, 13-15]. Ul¬ 
timately, most methods optimize a given objective func¬ 
tion to find a partition into communities. This function 
contains, either explicitly or implicitly, its own defini¬ 
tion of a community. Modularity [16] has been, since its 
inception, the most extensively used measure for commu¬ 
nity detection. It belongs to a wider class of functions 
in which communities are defined by Potts model spin 
states and the quality of the partition is given by the 
energy of the system [17, 18]. Although this approach 
based on statistical mechanics may be appealing, em¬ 
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pirical evidence shows that in many cases these meth¬ 
ods are unable to capture the expected communities of 
the network [15, 19-22]. In fact, numerous studies have 
pointed out strong theoretical limitations of modularity 
approaches for community detection [23-29]. 

A proposed measure based on classical probability, 
called surprise [30], has been shown to systematically out¬ 
perform modularity-based methods on different bench¬ 
marks [15, 21]. Here we demonstrate how surprise can be 
expressed under an information-theoretic framework, by 
examining its asymptotic formulation. In particular, we 
describe surprise in terms of the Kullback-Leibler (KL) 
divergence [31]. This asymptotic formulation allows us 
to develop, for the first time, an efficient surprise max¬ 
imization algorithm. Incidentally, this also points to a 
straightforward extension of surprise to weighted graphs. 
Additionally, this enables a better analysis of its per¬ 
formance, and allows an analytic comparison to other 
methods. 

In particular, we compare surprise to a modularity 
model and the recently introduced measure of signifi¬ 
cance, which also detects communities based on the KL- 
divergence [22]. We show that surprise is more dis¬ 
criminative than modularity using an Erdos-Renyi (ER) 
null model, and that significance and surprise behave 
relatively similar. Additionally, we analyze the limita¬ 
tions of community detection, most notably the resolu¬ 
tion limit [23] and the detectability threshold [32]. We 
show that surprise is (nearly) unaffected by the resolu¬ 
tion limit, and works well in the limit of large number 
of communities with fixed community sizes. However, in 
the limit of large community sizes with a fixed number of 
communities, surprise works worse than ER modularity, 
as it tends to find smaller subgraphs within those larger 
communities. 

Apart from the choice of the null model, a key com- 
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Graph variables 


n 

Number of nodes 

m 

Number of edges 

M=C 2 ) 

Number of possible edges 

P=% 

Density 

Community variables 

n c 

Number of nodes in community c 

m c 

Number of edges in community c 

{m c ) 

Expected number of edges in community c 

_ m r 

Pc= Vt) 

Density of community c 

Partition variables 


mint = 53 c m c Total internal edges 

Mint = 53 c (’ 2 °) Total possible internal edges 

q = m ‘ nt Fraction of internal edges 

(q) = Expected fraction of internal edges 

TABLE I. Variables. 


ponent in community detection is how the difference be¬ 
tween the actual community structure and the null model 
is quantified. Relying on the KL-divergence to measure 
such difference results in more discriminative methods. 
We believe that this fact can improve current and future 
community detection strategies. 


II. SURPRISE 


In general, we denote a graph by G = (V. E ) consisting 
of nodes V = {1,... , 71 } and edges E C V x V, which 
has n = \ V\ nodes and m =\E\ links. The total number 
of possible links is denoted by M = (”), and the ratio 
of present links p = jj is known as the density of the 
graph. 

The general aim is to find a good partition V = 
{Vi,V2, ■ ■ ■ ,V r } of the graph, where each V c C V is a 
set of nodes, which we call a community. Such commu¬ 
nities are non-overlapping (i.e. V c fl Vd = 0 for all c / d) 
and cover all the nodes (i.e. (J V c = V). Each community 
consists of n c = \V C \ nodes and contains m c = |Ti c |edges. 
Obviously then 53 c n c = ti, but the total number of inter¬ 
nal edges TOi n t = m c i s smaller than the total number 
of edges so that m; n t < m. An overview of the relevant 
variables is provided in Table I. 

Surprise is a statistical approach to assess the qual¬ 
ity of a partition into communities. Given a graph with 
n nodes, there are M = ((J) possible ways of drawing 
m edges. Out of those, there are M- lnt = Y 2 C C2 ) P os_ 
sible ways of drawing an internal edge. Surprise is then 
defined as the (minus logarithm of the) probability of ob¬ 
serving at least TOi„ t successes (internal edges) in m draws 
without replacement from a finite population of size M 


containing exactly M- mt possible successes [30, 33]: 


S(V) 


min (ra, Mi n t) 

-iog ^ 

i=m in t 



(i) 


which derives from the hypergeometric distribution. 


1. Asymptotic formulation 

However, this formulation presents some difficulties. It 
is not straightforward to work with, nor is it simple to 
implement in an optimization procedure, mainly due to 
numerical computational problems. Since we are usually 
interested in relatively large graphs, an asymptotic ap¬ 
proximation may provide a good alternative. The asymp¬ 
totic expansion we consider here assumes that the graph 
grows, but that the relative number of internal edges 
q = and the relative number of expected internal 
edges ( q) = remains fixed. By only considering the 
dominant term, we obtain a simple and elegant approxi¬ 
mation (see Appendix A) 

S(V)*mD(q\\(q)), ( 2 ) 

where D(x || y) is the KL divergence 

D(x \\ y) = x log — + (1 — x) log ^—-. ( 3 ) 

y 1 - y 

The KL divergence measures the distance between two 
probability distributions (although it is not a proper met¬ 
ric), with in this case the Bernoulli probability distribu¬ 
tions x, 1 — x and y, 1 — y. Notice that, in general, 
D{x || y) ^ D(y || x). In this case, q and (q) denote 
the probability that a link lies (or is expected to lie) 
within a community. Whenever q = (q), we have that 
D{q || (q)) = 0 and, otherwise, D{q || (q)) > 0. Since we 
are looking for relatively dense communities, we generally 
have q > (q). 

The original formulation of surprise in Eq. (1), based 
on a hypergeometric distribution, can be accurately ap¬ 
proximated by a binomial distribution. The only dif¬ 
ference between both approaches is that in the former 
links are drawn without replacement. Consider again 
q = fraction of internal edges in the partition, 

and (q) = the expected fraction of internal edges. 

The binomial formulation of surprise would then be 

min(m,M in t) , . 

S(V) = -log ^ (jW(l -<«»"*-< (4) 

The asymptotic development for the dominant term of 
binomial surprise is simpler. We use Stirling’s approxi¬ 
mation, 
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where H{x) = —x log x — (1 — x) log(l — x) is the (binary) 
entropy and we use that m; n t = qm. Binomial surprise 
then becomes 


S(V) « -m\H(q) + qlog(q) + (1 - g)log(l - (q)) 
= mD(q || (q)) 


Thus, as expected, for large sparse networks the differ¬ 
ence between drawing with or without replacement is 
negligible. 


2. Algorithm 

Evaluating the quality of a partition using surprise 
shows excellent results in standard benchmarks. In fact, 
it has been shown that a meta-algorithm of selecting the 
partition with the highest surprise, from a set of candi¬ 
date solutions provided by the best community detection 
algorithm solutions, outperforms any single algorithm by 
itself [15, 21, 34]. However, no algorithm for directly op¬ 
timizing surprise has been developed yet. 

The asymptotic formulation allows a straightforward 
algorithmic implementation, in a similar fashion as the 
Louvain algorithm [35], which was initially designed to 
optimize modularity. The basic idea of the Louvain al¬ 
gorithm consists of two steps. We move around nodes 
from one community to another so as to greedily improve 
surprise. When surprise can no longer be improved by 
moving around individual nodes, we aggregate the graph, 
and repeat the procedure on the aggregated graph. 

The aggregation of the graph is simply the contrac¬ 
tion of all nodes within a community to a single “com¬ 
munity node”. The multiplicities of the edges are kept 
as weighted edges, so that w c d = Y2iev c jev d w ij denotes 
the weight between the new nodes c and d in the aggre¬ 
gate graph, where initially w^ = A^. Here, A,j = 1 if 
there is an edge between i and j, and 0 otherwise. We 
additionally need a node size to keep track of the total 
size of the communities, similar to [29]. Initially we set 
this node size to rq = 1, and upon aggregation the node 
size n c = Y2iev rii is set to the total number of nodes 
within the community. 

One of the essential elements of the Louvain algorithm 
is that the surprise of the partition on the aggregated 
graph is the same as the surprise of the original parti¬ 
tion on the original graph. This ensures that moving a 
node in the aggregated graph corresponds to moving a 
whole community in the original graph. In other words, 
if V denotes the partition of G and V' = {1,2,... ,r} de¬ 
notes the default partition of the aggregated graph G", 
then <S(V, G) = <S(V',G'). For calculating surprise in 
the aggregated graph, we then use m c = JA j GV , w L as 
the internal weight and n c = Y2ieV' as the commu¬ 
nity size and n = Y2 C n c- With the other definitions 
remaining the same, it is straightforward to see that 
S(V,G) = S(V',G'). Notice that the same formulations 


can also be applied to the original graph, when using 
= Aij and m = 1. 

Using this formulation of the aggregate graph, it is 
quite straightforward to calculate the improvement in 
surprise when moving a node. Before we move node 
i from community c to community d, assume we have 
m; nt internal edges, and M- lnt possible internal edges. 
The total weight between node i and community c is 
w ic = Y2j^iev c w ij an d similarly between node i and 
community d, with a possible self-loop of wu. The new 
internal weight after moving node i from community c 
to community d is then m( nt = mint — Wi C + Wid ■ The 
change in M- mt = Y2 C ( ? 2 C ) slightly more complicated. 
After the move, we obtain n' c = n c — rii and n' d = rid + rii, 
so that M/ nt = M int + ni{rii + n d — n c ). Finally, we use 

q' = and (q r ) = The difference in surprise for 

moving node i from community c to community d is then 
simply 

A S((Ji = c^ d) = m (J D{q || (q)) - D(q' || (q’))) , (6) 

where we denote the community of node i by Oi (i.e. 
Oi = c if * (E V c ). The algorithm can then be simply 
summarized as follows: 

function OPTlMlZESURPHlSE(Graph G ) 
while improvement do 

(Ji i — i for i — 1 ,... , |V(G)|. > Initial partition 
while improvement do 

for random v € V ( G ) do 

<j v 4— arg max d AS (a v = c >—>■ d) 

end for 
end while 

a'i = <j a ‘. > Community in original graph. 

G 4- AggregateGraph(G) 

end while 
return <j' 

end function 

Incidentally, our formulation for surprise for the aggre¬ 
gated graph yields a weighted version of surprise. While 
keeping the same formulation of surprise as in Eq. (2), we 
only need to change the definitions of q and (q). Then 
q = w c/ w w h ere w c = ie y c w^ is the internal 
weight and w = Y2-,j w ij is tti e total weight. Assuming 
then a uniform distribution of weights across the graph 
in the random graph, the expected weights of an edge 
would be (w), which would not show too much devia¬ 
tion. The total possible internal weight is then 
while the total possible weight would be ( w)M. Hence, 
(q) = Mi n t/M remains unchanged. 

We provide an open-source, fast and flexible C++ im¬ 
plementation of the optimization of surprise using the 
Louvain algorithm. It is suitable for use in python using 
the igraph package. This implementation is available 
from GitHub as louvain-igraph and from PyPi ! sim- 


1 https://github.com/vtraag/louvain-igraph 

2 https : //pypi .python, org/pypi/louvain/ 
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ply as louvain and implements various other methods 
as well. 


By Pinsker’s inequality this is related to the KL diver¬ 
gence as 


III. COMPARISON 


q-(q) < \l 2 D (q II (?)), 


(13) 


We now review how surprise compares to some closely 
related methods. There are many other methods still, 
and we cannot do all of them justice here. For a more 
comprehensive review, please refer to [6, 36]. 


A. Modularity 


Although relatively recent, modularity has rapidly be¬ 
come an extremely popular method for community detec¬ 
tion. The general idea is that we want to find a partition, 
such that the communities have more internal links than 
expected. In its original formulation, modularity assumes 
a null model in which the degree ki of a node is fixed [16], 
the so called configuration model [37]. This implies that 
the expected number of internal edges is 


K 2 


( 7 ) 


where K c = Xaev i s the total degree of nodes in com¬ 
munity c. Modularity compares this value to the ob¬ 
served number of edges m c within the community, and 
simply sums the difference. The measure is usually nor¬ 
malized by the total number of edges, obtaining 




( 8 ) 


This random graph null model represents the configura¬ 
tion model, where the degree dependency of the nodes is 
taken into account. We therefore refer to it as the CM 
modularity. 

Alternative derivations of modularity have been pro¬ 
posed, some of them with different null models [17]. Sur¬ 
prise implicitly assumes a null model in which every edge 
appears with the same probability p, as in an ER random 
graph. The number of expected edges in a community of 
size c is thus 

(mc)=p(^y (9) 

Plugging this null model into modularity, we obtain its 
ER version [17] 

2ER(V) = ^£(m c -p( n 2 c )), (10) 


There is an interesting relationship between this ER 
modularity and surprise. Given that p = m/M, we can 
write 


Qer(V) = E - 

c 

= «-(«)• 


E 


M 


( 11 ) 

( 12 ) 


and, therefore, 

S(V) = mD(q || (q)) > 2toQ E r(V) 2 . (14) 

This implies that whenever surprise is low, modularity is 
also low. Whenever a good partition (in the sense of be¬ 
ing different from random) cannot be found by surprise, 
it is unlikely that modularity will be able to find one. 
While Eq. (14) is sometimes tight, on some partitions 
surprise can be much larger than modularity, making it 
more likely to be selected as optimal while escaping the 
scrutiny of modularity optimization. In this sense, sur¬ 
prise is more discriminative than modularity 

To illustrate this, consider a one dimensional circular 
lattice with neighbors within distance 3. In other words, 
node i is connected to nodes * — 3 mod n to * + 3 mod n 
(excluding the self-loop). We create partitions consist¬ 
ing of r communities by grouping consecutive nodes such 
that n/r nodes are in the same community. The ER 
modularity reaches its maximum with just a few com¬ 
munities (Fig. 1). Modularity indeed often detects only 
few communities, part of the problem of its resolution 
limit [23, 24, 29]. Both surprise and significance (see next 
section), still increase whereas ER modularity is already 
decreasing again. ER modularity may not be able to dis¬ 
cern partitions with many communities, whereas surprise 
and significance can. On the other hand, when surprise 
goes to 0 we see that ER modularity indeed also goes to 
0 , showing the upper bound provided by surprise. 


B. Significance 

Significance [22] , a recently introduced objective func¬ 
tion to evaluate community structure quality, presents 
an approach similar to surprise. Surprise describes how 
likely it is to observe TOi nt internal links in communities. 
Significance, on the other hand, looks at how likely such 
dense communities appear in a random graph. Compar¬ 
ing the two measures is not immediately straightforward. 
On the one hand, if dense communities are unlikely to be 
present in a random graph (high significance), then a 
community is also unlikely to contain many links at ran¬ 
dom (high surprise). On the other hand, if a community 
is unlikely to contain many links at random (high sur¬ 
prise), perhaps there are still communities elsewhere in 
the random graph that contain so many links. Therefore 
we should compare the two more formally to make more 
exact statements. 

Asymptotically, significance is defined as 

Z(y) = Y J ^D{p c \\p), (15) 
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FIG. 1. (Color online) Comparison of bounds. We show the 
quality of partitions of a lattice in r communities. ER mod¬ 
ularity quickly reaches a maximum for few communities (we 
show 2mQ| R rather than Qer for comparison). Both signifi¬ 
cance and surprise reach a maximum for much more commu¬ 
nities. This illustrates that ER. modularity is simply unable 
to discern partitions with such a high number of communities. 
The inset shows the same data, but on a logarithmic x-axis. 


where p c = m c /( r ^ c ) is the density of community c, p 
is the density of the graph and D(x || y) is again the 
KL divergence. Significance also showed a great perfor¬ 
mance in standard benchmarks, and helped to determine 
the proper scale of resolution in multi-resolution meth¬ 
ods [22]. 

Both surprise and significance are based on the KL di¬ 
vergence to compare the actual number of internal edges 
to the expected one. However, they do so in different 
ways. Whereas surprise compares such difference using 
global quantities, q and ( q ), significance compares each 
community density p c to the average graph density p. 

This implies, among other things, that only signifi¬ 
cance is affected by the actual distribution of edges be¬ 
tween communities. In particular, moving edges from a 
denser community (with a high p c ) to a sparse community 
(with a low p c ), generally decreases the value of signifi¬ 
cance. This means that if all communities have the same 
density, ceteris paribus, significance is minimal. This in¬ 
tuition is confirmed by convexity of the KL divergence 
(see Appendix B), so that significance is lower-bounded 

by 


Z(V) > M int D((p c ) 
with the weighted average density 


P) 


(16) 


/ x (It ) m int q 

( Pc} = 2 ^ TiT~ Pc = TH— = P 


Ms 


int 


Afi n 


= *W (17) 


Convexity of the KL divergence, also shows that 


Z(V) >S(V) 


(18) 


whenever (q) < p (see Appendix B). To gain more insight, 
we can slightly rewrite (q) to obtain 


<«> = 



E c n l = 1 (»e) 

n 2 r ( n c ) 2 


(19) 


Then, in general, (q) will be inversely proportional to the 
number of communities, and increases with the variance 
of the community sizes n c . Hence, if the number of com¬ 
munities is relatively large (small (q)), or the network is 
relatively dense (large p), significance is more discrimi¬ 
native than surprise. However, in the case that (q) > p, 
surprise can be more discriminative than significance (see 
appendix B). Notice that if (q) = p, then p c = q, so that 
D((p c ) || p) = D(q | (q)) and significance and surprise 
values are close to each other. Therefore, the two mea¬ 
sures are expected to behave relatively similar, especially 
for (q) ss p. Nonetheless, in dense networks with many 
communities significance would be more discriminative, 
whereas for fewer communities or sparse graphs, surprise 
would show a better performance. 


IV. LIMITATIONS 

Although modularity was lauded by the possibility to 
detect communities without specifying the number of 
communities, this came at a certain price. One of the 
best known problems in community detection is the res¬ 
olution limit [23], which prevents modularity from de¬ 
tecting small communities. It thus tends to underesti¬ 
mate the number of communities in a graph, and lumps 
together several smaller communities in larger communi¬ 
ties. Moreover, this depends on the scale of the graph, 
so that modularity has a problem of scale. It was shown 
that this is the case for both ER and CM modularity, and 
that other null models also suffer from the same draw¬ 
backs [24]. In fact, most methods are expected to suf¬ 
fer from this problem, and only few methods are able 
to avoid it completely [29] . Additionally, there is also a 
lower counterpart to the resolution limit, leading to un¬ 
necessary splitting of cliques [38, 39|. Finally, modularity 
is also myopic, cutting across long dendrites [40]. An¬ 
other fundamental limit in community detection is called 
the detectability threshold [32], which also has some 
counter-intuitive effects [41]. This prevents any method 
from correctly detecting communities beyond this thresh¬ 
old. The asymptotic formulation of surprise enables us to 
understand better how it performs with respect to these 
limitations. 


A. Resolution limit 

The resolution limit is traditionally studied through 
the ring of cliques [ 13] . This is a graph consisting of r 
cliques (i.e. completely connected subgraphs) connected 
only by one link between two cliques to form a ring. This 
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is one of the most modular structure possible: we cannot 
delete more than one link between communities and still 
keep it connected, while we cannot add any more links 
within the cliques. When a method starts to join the 
cliques, it can no longer detect the smaller cliques, and 
so a fortiori , cannot detect less well defined subgraphs ei¬ 
ther. We denote by qi (and (qi)) the (expected) propor¬ 
tion of edges within communities for the partition where 
each community contains a single clique and use <72 (and 
( 52 )) for the partition where each community contains 
two cliques. To facilitate the derivation, we work with 
self-loops (and directed edges), so that the total number 
of edges is nf within communities respectively. Let r de¬ 
note the number of cliques. Then obviously n = rn c and 
to = rnf. + 2r. For the partition of each clique in its own 
community we then obtain 


Q 1 = 



<9i> = \, (20) 


while for the partition with 2 cliques merged we obtain 


T) ^ 1 2 

52 = -f^w, (®) = -. (21) 

+ 2 r 

Hence, q 2 = qi + e with e = r J + , 2 and (q 2 ) = 2(q{). The 
difference of surprise is 


A5 = ^ - D(q 2 || (q 2 )) - D( qi || ( qi )) (22) 

m 

which works out to 


AS = q 1 log 


92 (qi, 
( 92 ) qi 


+ (1 - 9i) l°g 


elog 


1 92 1 - (91) 

1 - ( 92 ) 1-91 
92 1 - ( 92 ) 


( 92 ) 1 - 92 


(23) 


Approximating r-2s;r-l«rwe obtain 

r Qn 

AS « -D(qi\\q 2 )- qi log 2 + e log - . (24) 

Z L — q 2 

Solving for r at the point at which AS = 0 yields 


r = 2-—— exp (~D(q 1 || < 72 )^ 2* 1 (25) 

92 V e ) 

n 2 

which scales as r ~ Af- so that for larger r surprise starts 
to merge cliques. 

Working out the inequality for both CM and ER mod¬ 
ularity we obtain that r ~ nf.- Hence, the number of 
cliques r at which modularity starts to merge cliques lies 
considerably lower than for surprise and grows linearly 
with the square of community sizes rather than expo¬ 
nentially. So, although surprise shows a similar prob¬ 
lem as modularity, it only starts to show at really large 
graphs, so is unlikely to be a problem in any empirical 
graph. Indeed, this demonstrates exactly the key differ¬ 
ence between modularity and surprise: The first is unable 
to detect relatively small communities in large graphs, 
whereas the latter has (nearly) no such difficulties. 


B. Detectability threshold 


In order to study the detectability threshold, we first 
introduce the planted partition model. This means, that 
we build a graph such that it will contain a specified par¬ 
tition: We plant it in the graph. We create n nodes and 
assign each node to a certain community. An edge within 
a community is created with probability whereas an 
edge in between two communities is created with proba¬ 
bility Pout- We define the probability of an internal edge 
Pin and the probability of an external edge to be respec¬ 
tively 


Pin = 


(1 ~ lA k 

n c — 1 


Pout — 


fik 

n — ric 


(26) 


so that the average degree is k and p is the probabil¬ 
ity that an edge is between communities. When p = 0 
all links are thus placed within the planted communities, 
whereas for p = 1 all links are placed between the planted 
communities. Uncovering the planted communities cor¬ 
rectly is trivial for p = 0 but becomes increasingly more 
difficult for higher p. The average degree within a clus¬ 
ter is k[ n = (1 — p)fc while the average degree between 
clusters is fc out = pfc. We denote community sizes by n c 
for the r different communities. 

Notice that, most conveniently, q = 1 — p, while 

/ 2 \ 

( 9 ) = r {n) 2 ~ We can thus easil y calculate 5 p i t the sur¬ 
prise for the planted partition. Since S > 0 by def¬ 
inition, communities can thus only be detected when 
1 —p > This yields the rather trivial detectability 

threshold of 


p < 



(27) 


In the case of equi-sized communities, this reduces to the 
familiar trivial threshold p < [19]. 

However, due to stochastic fluctuation, the commu¬ 
nities become already ill-defined prior to the threshold. 
Indeed 5 = 0 provides a rather naive bound, since S > 0 
also in random graphs. In general, 5 = 0 for both trivial 
partitions of one large community and n small commu¬ 
nities (since then q = (<?)), so that optimizing surprise 
in a random graph will yield some partition with strictly 
positive surprise. This implies that at some (lower) crit¬ 
ical p* the community structure is essentially no longer 
discernible from the community structure in a random 
graph. Hence, we should not consider when 5 p it > 0 but 
when 5 p it > 5 rnc j where 5 rn d is the surprise attainable in 
a random graph. We first examine the case with r = 2 
and n c = n/2. Previous literature found a detectabil¬ 
ity threshold for k m — k out < \Jk- m + k out [32, 42, 43]. 
Beyond this threshold, the optimal bisection becomes in¬ 
discernible from an optimal bisection in a random graph. 
This threshold thus coincides with the expected number 
of internal edges for an optimal bisection in a random 
graph. We can use this to calculate 5 rn( j(2) the max¬ 
imum surprise for a bisection in a random graph. Let 
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FIG. 2. (Color online) Limitations on community detection. We construct graphs with a planted partition, with a probability 
of an edge between communities of /z = 0.1. We show the quality ratio between the quality of the partition found by 
optimization S and the quality of the planted partition <S p it (and similarly for ER modularity). Hence, if the quality ratio 
> 1, the planted partition is no longer optimal. In the figure on the left we consider the case for fixed community size 
n c = 10, but increase the number of communities r. The results show that in this case surprise finds the planted partitions, 
whereas ER. modularity has more difficulties, in line with our analysis. This is mostly due to the resolution limit in modularity, 
which underestimates the number of communities. In the figure on the right we consider the case of a fixed number of 
communities r = 2 but an increasing community size n c . In this case, surprise quickly finds other partitions than the planted 
partition, whereas modularity remains closer to the planted partition, consistent with our analysis. This is mostly because 
surprise tends to find substructure in the rather large communities arising from random fluctuations, which thus overestimates 
the number of communities. However, modularity also has some difficulty in finding the planted partition. This figure shows 
the average over 5 replications for each setting, and the error bars show the standard deviation. 


us denote by g rn d(2) the probability an edge is within 
a community in the best bisection for a random graph. 
Substituting k in = <? rn d(2)A; and £; out = (1 — <? rn d(2))A: and 
solving for < 7 rn d( 2 ) yields 

9rnd(2) = 2 + 

We thus obtain <S rnd (2) = mD(q ln d (2) || |) for the max¬ 
imum surprise for a bisection in a random graph. If 
S p it(2) < 5 rn d(2) the planted partition is no longer op¬ 
timal, and we will likely find an alternative partition 
with surprise equal to <S rn d(2). The threshold is then 
[i* = 1 — q ln d(2), congruent with previous results. So, 
in general, surprise is expected to show similar behavior 
concerning the detectability threshold as other methods. 

However, this analysis restricts itself to finding the 
same number of communities (i.e. two in this case), while 
it is possible that an optimal partition would split the 
graph in more communities. In other words, we need to 
compare the surprise of the planted partition to the max¬ 
imum surprise in a random graph, while allowing more 
than two communities. Although the expected value of 
the maximum surprise in a random graph is not easy to 
find, a random graph is likely to contain a near perfect 
matching. Using that, we can derive a lower bound on 


the expected surprise in a random graph. In such a per¬ 
fect matching there are r = J communities which con¬ 
tain 1 link each. For a graph that contains m = nk edges, 
then q= while (q) = ^. This leads to a surprise of ap¬ 
proximately 6> rn d(?) ~ ? log fj:- Hence, whenever we ob¬ 
tain that Spit < S rn d(§) optimization should find another 
partition than the planted one. In the case of two planted 
communities, we require that D{ 1 — /u || \) > l °\£ h to 
make sure that we still detect the two clusters. Although 
we cannot solve explicitly for /z, this inequality shows 
that n is bounded above by 

n< 4fce 2feD ( 1 -' j|| 5). (29) 

If n grows large, there is likely some structure arising 
from random fluctuations within the planted communi¬ 
ties. Notice that there are likely better partitions than 
a perfect matching. We can therefore expect the actual 
critical n for which the planted partition is no longer op¬ 
timal to be lower. 

We can similarly derive such thresholds for ER mod¬ 
ularity. For a perfect matching the ER modularity is 

Qrnd(f) = n- Then Solvin g Qplt < Qrnd(f) gives 

us an estimate of when ER modularity is likely to find 
an alternative partition (i.e. a perfect matching in this 
case). The critical /z* can in this case be explicitly de¬ 
rived and yields /z* = ^ (l — \ + -)■ However, the de- 














tectability threshold is already reached before that point 
at fi* = \ ^1 — , leaving n essentially unbounded. 

Again, there will be better partitions than a perfect 
matching, so that n may still be bounded to some ex¬ 
tent. Nonetheless, this shows that ER modularity is less 
affected by the size of the communities than surprise, 
and is less likely to find substructure within the planted 
communities. 

In summary then, surprise does not tend to suffer from 
the resolution limit, but does quickly find substructure 
due to random fluctuations. ER modularity on the other 
hand suffers from a resolution limit, but tends to ignore 
substructure in communities. Stated differently, for a 
planted partition model with r communities and n = rn c 
nodes, surprise and ER modularity work well in different 
limits. Whenever r —> oo with n c fixed, surprise works 
well but ER modularity works poorly. Whenever r is 
fixed but n c —» oo, ER modularity works well, but sur¬ 
prise works poorly. An interesting question would con¬ 
cern which method would work well for both limits. 


V. EXPERIMENTAL RESULTS 

We here confirm our theoretical results experimentally. 
We first show numerically that the asymptotic formu¬ 
lation of surprise provides an excellent approximation. 
Secondly, we validate the inequalities between surprise, 
significance and ER modularity. Thirdly, we show the 
different limitations on surprise and modularity. Finally, 
we demonstrate that the asymptotic formulation of sur¬ 
prise performs very well in LFR benchmarks [44]. 

For comparing the asymptotic formulation with the ex¬ 
act hypergeometrical and binomial formulation, we used 
regular rooted trees with three children. To create such 
trees, we first create the root node, and add three children 
to this root node. We then keep on adding children to the 
leaves of the tree until we obtain the desired number of 
nodes. We use trees to minimize the number of edges to 
prevent numerical problems with the hypergeometrical 
and binomial formulation. Using relatively large num¬ 
bers results in numerical issues, preventing a comparison 
to the asymptotic formulation. We optimize asymptotic 
surprise using the Louvain algorithm to find a partition 
on this graph. As can be seen in Fig. 3, the approxi¬ 
mation is quite good, and the approximation ratio tends 
to 1. Notice that the number of nodes in these graphs 
is limited to 200, whereas complex networks are usually 
much larger. Hence, we expect the approximation to be 
accurate for any real network. 

To demonstrate the limitations on surprise and (ER) 
modularity we create some test networks with a planted 
partition. We generate networks with average degree 
( k ) = 10 and set /r = 0.1. In the first test, we create 
networks with fixed community sizes n c = 10 and vary 
the number of communities r. In the second test, we have 
fixed the number of communities to 2 but vary the com¬ 



FIG. 3. (Color online) Approximation of surprise. The 
asymptotic formulation of surprise, using the KL diver¬ 
gence, approximates well both the binomial and the hyper¬ 
geometric surprise. The inset shows the approximation ra¬ 
tio Sasym/Shyper and 5 asym /5binom, both going to 1 for large 
graphs. 


munity size n c from 10 to 500. We consider whether the 
planted partition remains optimal by analyzing the qual¬ 
ity of the planted partition 5 p it (or Q p i t for modularity) 
and the partition found through optimization S (or Q 
for modularity). Whenever <S p i t < S we thus know that 
the planted partition remains no longer optimal. The 
results shown in Fig. 2 clearly confirm our theoretical 
analysis. In the case where r —> oo with fixed n c , sur¬ 
prise does well, whereas (ER) modularity suffers from 
the resolution limit. In the case that r is fixed to r = 2, 
but n c —> oo, surprise does less well, as it tends to find 
subgraphs within the two large communities. Modular¬ 
ity also has problems identifying the optimal bisection. 
Indeed, the uncovered partitions do not coincide exactly 
with the planted partition, even though the modularity 
value remains rather similar. Such partitions are likely 
to occur because of the degeneracy of modularity [ 20 ]. 
Nonetheless, our results show that the modularity of the 
planted partition remains (nearly) optimal, whereas sur¬ 
prise for the planted partition clearly diminishes com¬ 
pared to surprise of the uncovered partitions. 

We also tested the various methods more extensively 
using benchmark graphs with a more realistic commu¬ 
nity size and degree distribution [44]. We set the av¬ 
erage degree ( k ) = 20 while the maximum degree is 50 
and follows a powerlaw degree distribution with expo¬ 
nent 2. Planted community sizes range from 10 to 50 for 
the “small” communities, and from 20 to 100 for “large” 
communities. The planted community sizes are also dis¬ 
tributed according to a powerlaw, but with an exponent 
of 1. The parameter /i again controls the probability of 
internal links. 

In Fig. 4 we show the function values for surprise, sig¬ 
nificance and ER modularity. This clearly shows that the 
inequalities hold over the whole range of mixing parame¬ 
ters. At the same time, they show very similar behavior 
to each other. Although this could indicate a relatively 
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FIG. 4. (Color online) Inequalities. In most cases significance 
is more discriminative than surprise, which is more discrimi¬ 
native than the ER modularity, so that Z > S > Qer- These 
inequalities clearly hold over the whole range of the mixing 
parameter /r for LFR. benchmarks (n = 10 4 ). For ER modu¬ 
larity we display 2m.Q% R as used in Eq. (14). 

similar performance, we next show this is not the case. 

In Fig. 5 we show the benchmark results for the 
four different methods. Surprise and significance per¬ 
formances are very good, and clearly much better than 
both modularity models. Notice that, surprise and ER 
modularity use the same global quantities. However, the 
use of the KL divergence gives the former a much greater 
advantage, as expected from Eq. (14). 

LFR benchmark graphs have a clearer community 
structure for larger graphs. The critical mixing param¬ 
eter at which the inner community density equals the 
outer community density is roughly ^ « 1 — so that 
with growing n this threshold goes to 1. Both surprise 
and significance start to work better for somewhat larger 
graphs, consistent with the clearer community structure. 
This is in a sense the opposite of both ER and CM mod¬ 
ularity. Their performance is worse for larger graphs, 
consistent with our earlier analysis of the limitations of 
community detection. 

VI. CONCLUSION 


null model. The larger the network and the smaller the 
communities, the better KL methods perform with re¬ 
spect to modularity. Indeed, whereas modularity suffers 
from the resolution limit, this problems (nearly) doesn’t 
affect surprise. On the other hand, surprise tends to find 
substructure in larger communities, arising from random 
fluctuations, whereas this problems appears less promi¬ 
nent for modularity. In short, modularity tends to work 
well in the limit of community sizes n c —> oo keeping the 
number of communities r fixed. Surprise on the other 
hand works well when r —1- oo keeping the community 
sizes n c fixed. Stated differently, modularity tends to 
underestimate the number of communities, whereas sur¬ 
prise tends to overestimate the number of communities. 
The question of which method works well in both limits 
deserves further study. 

The slight differences between surprise and significance 
stem from two things either the one or the other mea¬ 
sure ignores. Significance relies on the fraction of edges 
that are present within a community. It thus implicitly 
considers missing edges within communities, because this 
fraction is relative to the total number of possible edges 
within that community, which surprise does not. Sur¬ 
prise on the other hand, considers the fraction of total 
edges that fall within communities. It thus implicitly 
considers edges that fall between communities, whereas 
significance does not. Indeed, it should be possible to 
address these shortcomings by also explicitly examining 
missing links (for surprise) or links between communities 
(for significance). 

Another shortcoming is that surprise does not depend 
on the actual distribution of the internal edges among 
communities. One way to address this issue is to con¬ 
sider edges for all communities separately, by using a 
multivariate hypergeometric distribution. In that case, 
we would be interested in the probability to observe m cd 
edges between communities c and d as 



Pr(X cd = m cd ) = cd . (30) 

\mj 


Community detection is an important topic in the field 
of complex networks, as it can give us a better under¬ 
standing of real-world networks. Here we analyzed a re¬ 
cent measure known as surprise. We developed an accu¬ 
rate asymptotic approximation, based on the KL diver¬ 
gence which we use to develop a competitive new algo¬ 
rithm. Applying this algorithm to standard benchmarks, 
we show its great potential. Significance, another qual¬ 
ity measure also based on the KL divergence performs 
similar to surprise. 

We showed analytically that surprise is more discrim¬ 
inative than modularity with an ER null model. This is 
mainly due to the use of the KL divergence to quantify 
the difference between the empirical partition and the 


Again deriving an asymptotic expression, we arrive at 


<S(V) =m^q cd log 

cd 


Qcd 
( Qcd) 


mD( q || (q)) 


(31) 


where q cd = is the fraction of edges between com¬ 
munities c and d and ( q cd ) the expected value. 

Interestingly, the extension of surprise in Eq. (30) is 
identical to a stochastic blockmodel (using an ER null 
model) [45, 46]. However, Karrer and Newman found 
that this method did not work satisfyingly [45]. This 
might be because the measure does not focus on commu¬ 
nities specifically, but rather on all types of block struc¬ 
tures. Hence, there is no reason why a community struc¬ 
ture should maximize this likelihood, rather than any 
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FIG. 5. (Color online) Benchmark results. The first row shows results for “small” communities, which range from 10-50, while 
the second row contains results for “large” communities, ranging from 20-100. The community sizes are powerlaw distributed 
with exponent 1. We set the average degree (k) = 20 and the maximum degree is 50, which follows a powerlaw degree 
distribution with exponent 2. Both surprise and significance perform very well, especially for relatively large graphs, where ER 
and CM modularity fail. This difference is more notable for smaller communities, for which both ER. and CM modularity have 
difficulties. This is in part due to the well-known resolution limit and in line with our earlier analysis. 


other type of block structure. One possible way to ad¬ 
dress this is to compare our partition to the ideal type 


we are looking for, rather than maximizing the difference 
to a random null model. This would be an interesting 
avenue to consider in future research. 
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Hence, for the dominant term, we obtain 


S(V) 



( a-(q))M 
\p(l - q)M 


p(l-q)M 

(A6) 


log p~ pN 





(A7) 


The term p pRI is independent of the partition and we 
ignore it, which yields 


<S(V) = —pM (q log — + (1 — q) log ^ . (A8) 

V 9 1 -9 / 

Using pM = to, we can rewrite this to 

S(V) = mD(q || (q)) (A9) 

where D(x || y) is the KL divergence [31] 

D(x \\y) = x log — + (1 — x) log ^(A10) 

y i -y 

which can be interpreted as the distance between the two 
probability distributions q and (q). 


Appendix A: Asymptotic surprise 

As stated in the main text, q denotes the fraction of 
internal edges, so that we can write ?7ij nt = qm. Since 
m = p(= pM , we thus have mint = qpM. Similarly, 
we can write M- mt = ( q)M. Hence, we obtain 

m = pM, (Al) 

m int = qpM, (A2) 

M int = (q)M. (A3) 

Notice that all quantities now depend on M. We only 
take into account the dominant term, so to obtain 

1(1 - (q))U\ 

S,V) = -log UWUhriW (A4) 

\pMJ 

which corresponds to the probability of observing exactly 
mint internal links. The binomial coefficient ( r ^l r ) is inde¬ 
pendent of the partition, so we ignore it. We use Stirling’s 
approximation of the binomial coefficient which reads 



Appendix B: Significance 

We can calculate the approximate difference of mov¬ 
ing an edge from one community to another. Assume we 
move an edge from community r to community s. The 
change in the density will be approximately p r — ^2 and 

Ps + ^2 respectively. The corresponding difference in sig¬ 
nificance will be approximately 

Z{V) - Z(V) —n 2 (d( Ps + ± || p) - Dips II T)) 

+ n 2 r [D{p r - ^ || p)- D(p r || p)^j 

(Bl) 

~-S~D(p s || p ) - -^-D{p r || p) (B2) 

Op s OPr 

= log ——— = /\z. (B3) 

1 - Ps Pr 

This quantity is particularly straightforward (the loga¬ 
rithmic odds ratio), and if p r > p s the difference will be 
negative, and if p r < p s this quantity will be positive. 
Moving edges from a denser community to a less dense 
community decreases the significance. In other words, 
making two densities more equal decreases the signifi¬ 
cance. Repeating these steps, we should expect to find 
the lowest significance when the communities are of equal 
density. 
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Alternatively, by convexity of the Kullback-Leibler di¬ 
vergence, we obtain for significance that 

-<v> K?c?)M? < B4) 

Realizing that m c = p c (^ c ), we see that 


If there are fewer communities (i.e. if p > ( q )) the 
relationship is not entirely clear, but there are cases for 
which surprise may be larger than significance. For ex¬ 
ample, if we assume an equi-sized equi-dense partition 
with r communities, then q = Pc ^ I ' > and (q) = *, and the 
difference can be written as 


E 



mint 


Mint 



(B5) 


Notice that this can be interpreted as an average internal 
density ( p c ) as stated in the main text. Using this we 
arrive at 


Z(V) > M int D (p-jL || p) . (B6) 

Hence, the significance of a partition with different com¬ 
munity densities p c is generally larger than a partition 
where all communities have the same average density 
p c = r ^ n \ . Notice that pjj^ should be bounded by 1 
so that q > (q) > p(q) in general. 

This points to a bound such that Z(V) > <S(V) when 
(q) < p in the following way. Define A = so that 
A < 1 if (q) < p. Again applying convexity, we obtain 


«S(V) - Z{V) = m{l - q) log ^ 

- M int {1 - p c ) \0g~ -—• (Bll) 

1 ~P 


Z(V)>Mi nt D(^\\p) (B7) 

= M r{ XD[ Jq) ll^ + a-W 0110 )) (B8) 

> II X P) (B9) 

= m int D(q || (q)) = 5(V). (BIO) 


Indeed if (q) > p then S{V) > Z(V) for equi-sized equi- 
dense partitions. Keep in mind though that an equi-sized 
equally dense partition will have a lower significance in 
general, so that this does not hold for (q) > p in general. 



